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ABSTRACT 


Recent advances in natural language processing, primarily propelled by Large 
Language Models (LLMs), have showcased their remarkable capabilities grounded 
in in-context learning. A promising avenue for guiding LLMs in intricate rea- 
soning tasks involves the utilization of intermediate reasoning steps within the 
Chain-of-Thought (CoT) paradigm. Nevertheless, the central challenge lies in the 
effective selection of exemplars for facilitating in-context learning. In this study, 
we introduce a framework that leverages Dual Queries and Low-rank approxi- 
mation Re-ranking (DQ-LoRe) to automatically select exemplars for in-context 
learning. Dual Queries first query LLM to obtain LLM-generated knowledge such 
as CoT, then query the retriever to obtain the final exemplars via both question and 
the knowledge. Moreover, for the second query, LoRe employs dimensionality 
reduction techniques to refine exemplar selection, ensuring close alignment with 
the input question’s knowledge. Through extensive experiments, we demonstrate 
that DQ-LoRe significantly outperforms prior state-of-the-art methods in the auto- 
matic selection of exemplars for GPT-4, enhancing performance from 92.5% to 
94.2%. Our comprehensive analysis further reveals that DQ-LoRe consistently 
outperforms retrieval-based approaches in terms of both performance and adaptabil- 
ity, especially in scenarios characterized by distribution shifts. DQ-LoRe pushes 
the boundaries of in-context learning and opens up new avenues for addressing 
complex reasoning challenges. 


1 INTRODUCTION 


Recently, significant advancements in natural language processing (NLP) have been driven by large 


language models (LLMs) (Chen et al.|[2021}{Chowdhery et al.|/2022|/Ouyang et al.||2022} {Touvron| 
2023a\b 2023}|OpenAI||2023). With the increasing capabilities of LLMs, in-context 


learning (ICL) has emerged as a new paradigm, where LLMs make predictions based on contexts 


augmented with a few exemplars (Brown et al.\!2020). An important question in the field of in-context 
learning is how to improve the selection of in-context exemplars to enhance the performance of 


LLMs (Liu et a, 2022}. 
Selecting exemplars for ICL poses challenges due to their instability (Zhao et al.|!2021). Even minor 
changes in the order of samples within exemplars can affect the output (Lu et al.|/2022} 
2023). The selection of exemplars for LLMs is currently a community-wide trial and error effort, as it 
is difficult to extract generalizable regularity from empirical observations to form effective selection 
criteria (Fu et al. 2022b). One exception is retrieval-based exemplar acquisition 
methods (Rubin et al. [Ye et al.|/2023} [Li et al.||2023), where a retriever is 


used to select similar exemplars based on input questions during inference. 


However, these methods primarily focus on the similarity between input questions and examples in 
the training set, without fully exploiting the relationship between intermediate reasoning steps of 
the given question and other examples in the pool. Previous studies have shown that considering 
such chain-of-thoughts (CoT) can further improve the performance of LLMs on multi-step reasoning 


tasks 2022b 2022 2023). Furthermore, some work here has 


considered the interaction between Large Language Models and information retrieval systems [Xu] 


(2023). Additionally, we observed that prior efforts 2023) struggled to distinguish 


exemplars in high-dimensional embedding spaces. These observations suggest that exemplar selection 
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based solely on trained question embeddings may suffer from redundant information within the 
“universal” representations, and may not effectively capture inherent relevance. Removing these 
redundant information often leads to improved speed and effectiveness (Wang et al.|/2023b). The 
sentence embeddings within the retrieved exemplars often contain similar information, and employing 
Principal Component Analysis (PCA;|Wold et al.]1987) for dimensionality reduction can help filter 
out redundant information and distinguish different exemplars. 


To address these challenges, we propose a framework that leverages Dual Queries and Low-rank 
approximation Re-ranking (DQ-LoRe) to incorporate CoTs beyond the input questions, improving 
the exemplar selection process for in-context learning. DQ-LoRe first queries LLMs to generate 
CoTs for a given question. We then concatenate these CoTs with the question to query the retriever 
and obtain exemplars from the training pool, which share the same reasoning progress. We further 
apply PCA for dimensionality reduction to filter out redundant information and differentiate between 
different exemplars, further improving the selection process. 


We conducted extensive experiments on various multi-step reasoning benchmarks to evaluate the 
performance of DQ-LoRe. The results demonstrate that DQ-LoRe effectively and efficiently se- 
lects exemplars, outperforming existing methods. Furthermore, DQ-LoRe exhibits robustness and 
adaptability in the distribution shift setting, highlighting its versatility across different scenarios. 
These findings have implications for the use of low-rank constraints in the LLMs paradigm. Our 
contributions can be summarized as follows 


e We present DQ-LoRe that queries supplementary information from LLMs to re-query 
smaller-scale retrieval models. Following the acquisition of re-ranked exemplars from the 
smaller retrieval model, DQ-LoRe subsequently provides the exemplars to the LLMs for 
inference, effectively addressing the challenge associated with exemplar selection. 


We employ straightforward and efficient dimensionality reduction techniques to extract 
crucial reasoning information from the high-dimensional representations of CoTs and ques- 
tions. This enables the differentiation between various exemplars, particularly distinguishing 
between exemplars characterized by word co-occurrence and spurious question-related 
associations and those exemplars that exhibit genuine logical relevance. 


e We demonstrate that DQ-LoRe achieves superior performance compared to existing methods 
and is particularly effective in the distribution shift setting, showcasing its robustness and 
adaptability across various scenarios. 


2 RELATED WORK 


In-Context Learning LLMs have demonstrated their in-context learning ability with the scaling of 
model size and corpus size (Brown et al.|[2020}|Chowdhery et al.][2022}/OpenAl] |2023). This ability 
allows language models to learn tasks with only a few examples. Several studies have shown that 
LLMs can successfully perform various complex tasks using in-context learning, including natural 
language understanding and multi-step reasoning (Shin et al.] 
(2023). In addition to in-context examples, (2022b) have explored augmenting the learning 
process with CoT. CoT involves providing a sequence of intermediate reasoning steps along with 
the in-context example. Further studies show that the effectiveness of CoT can be enhanced through 


various approaches. These approaches include breaking down complex questions (Zhou et al.}/2022), 
planning before inference (Wang et al.|/2023a), and employing the CoT paradigm for multiple rounds 


of voting and reasoning (Wang et al.||2022{|Zheng et al.|/2023). Notably, in the case of multi-step 


reasoning, in-context learning with CoT has been found to outperform fine-tuning conducted on the 


same large model with the full training set (Lewkowycz et al.| [2022 2022a). 


Exemplar Selection The selection of good examples for in-context learning is a fandamental 
question. However, previous studies have highlighted the challenges and instability of exemplar 


selection (Zhao et al. [2021} 2022 2023). Even slight changes in the order of 


samples within exemplars can affect the model’s output. The acquisition of exemplars is crucial for 
enhancing multi-step reasoning capabilities (Liu et al.}|2022). Existing efforts mainly focus on the 
human-designed approach, the vanilla CoT (Wei et al.||2022b) utilizes eight manually written exam- 


'We will release all the source code upon acceptance. 
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Figure 1: The overall pipeline of DQ-LoRe. It consists of three parts: Dual Query first query LLM 
to obtain CoT y, then query the retriever to obtain the final exemplars via both question and LLM- 
generated knowledge. LoRe leverages PCA to approximate the low-rank embedding of retrieved 
exemplars, enabling us to better distinguish them. Retriever obtain exemplars with similar CoT, 
through training with positive and negative sets constructed based on CoT similarity produced by 
BM25 and LLM. 


ples, while PAL (Gao et al.|[2023) repurposes these a by converting them into programming 
language statements. Complex-CoT (Fu et al.|/2022) selects exemplars with the most complex CoTs 
from the training set, resulting in improved performance on multi-step reasoning tasks. 


Other efforts adopt a retrieval-based method that leverages encoders to encode exemplars and input 
questions during training 2023). This enables the 
selection of exemplars that are close to the input questions. For example, Efficient Prompt Retrieval 
(EPR; [Rubin et al-[2021) models the interaction between input questions and in-context examples and 
optimizes it through a contrastive learning objective to obtain preferred exemplars, while [Zhang et al.] 
clusters training instances into k categories and selects k samples closest to the cluster center. 
Compositional Exemplars for In-context Learning (CEIL; [Ye et al.2023) utilizes Determinantal 
Point Processes to model the interplay between the provided input and in-context examples. This 
modeling is further enhanced through a meticulously designed contrastive learning objective, with 
the goal of extracting preferences from language models. Existing retrievers are separately trained 
for different tasks, {Li et al.|(2023) propose a unified retriever to retrieve exemplars for a wide range 
of tasks. Unlike these methods, we propose to model the relationship between the reasoning process, 
enabling better exemplar selection. 


3 METHODOLOGY 


3.1 REASONING WITH DUAL QUERIES 


As shown in Figure [I] we first query the LLMs to generate CoT, we start with an initial n-shot 
exemplars denoted as e;,,;. These n-shot exemplars can be retrieved using BM25 based on their 
semantic similarity to the input question, or previous methods such as those proposed in 


(2022);{Rubin et al.|(2021);[Ye et al.|(2023); (Zhang et al.|(2022b). They can also be manually designed 
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exemplars 2022c 2022 2023 2023a) including CoT 


and other templates. 


In our experiments, we employ the Complex-CoT method to obtain the initial 
n-shot exemplars. This choice is motivated by our observation that using Complex-CoT prompts for 
querying LLMs can result in CoTs that are richer in inference information and more complex. These 
initial n exemplars and the question x; are used to query the LLM and obtain the CoT y;. 


With the question x; and the generated CoT y;, we use the encoder se trained in the following 
section 3.2)to obtain the embedding of the test sample t; and all exemplars in the training set. 


3.2 COT-AWARE RETRIEVER MODEL TRAINING 


We train a retriever that can measure the similarity between a CoT and a exemplar. Similar to previous 
studies (2023), we apply contrastive learning to 
the train a sentence encoder s, as our retriever. Specifically, we use data from the training set, whose 
sample d; = (xi, yi) contains a question x; and its corresponding CoT yj. 


Given a training sample d;, we construct its corresponding positive and negative set. We first employ 


BM25 (Robertson et al.|!2009) to retrieve the top-k similar training sample as candidate exemplars 


from the entire training set, denoted as D’ = {d}, d3, ..., dj, }. 


After obtaining these k samples, we re-rank them by considering how much the exemplar d’, close 
to the CoT. We apply a pretrained language model(LM) such as text-davinci-003 to calculate the 
probability: 

score(d;) = Pr m (yild;, xi) (1) 
where Pr m (yi|d;;, xi) is the probability of LM generating the CoT y; given the d} and input context 
xi. Higher score(d/) indicates the higher probability of d} entails CoT y; and share the similar 
reasoning logic. We re-rank the exemplars in D’ based on their score. We select the top t samples 


as positive examples, denoted as pos,, and the last m samples as challenging negative examples, 
denoted as neg;. 


During training, we construct the training batch by sampling anchors d;. For each d;, we randomly 
select one positive e;* and one negative example e;— from pos; and neg; of anchors d;. We consider 
the positive and negative examples of other samples within the same batch as negative for d;. Thus 
the contrastive loss with b anchors has the following form: 

sim(#:,yi,e} ) 


Ee 


1 
T 


Loss (zi, yi e7 ,e],e],...€7,...,67) =—lo 
( is Yi; Gj» €1 11 » 4? , a) g os esim(xi,4:.eF ) +5 esin(i,4i,€; ) 
(2) 
where sim is the similarity between the anchor sample d; = (x;, y;) and exemplar d}, and is the inner 
product of their sequence embedding: 


sim (xi yi, ez) = (se(£i + yi), Se(d;)). (3) 


The se represents the BERT-initialized sentence encoder trained using the aforementioned loss 
function. After training, we employ se as the sentence representation obtained by concatenating the 
question and CoT. We utilize the trained s, for retrieving exemplars and compute similarity using 
vector inner products. 


3.3 LORE: LOW RANK APPROXIMATION RE-RANKING 


Based on the similarity computed with Equation[3]and select the top-M exemplars Em (where M 
> N) to perform the re-ranking. The obtained M exemplars Em are retrieved based on semantic 
similarity and often exhibit highly similar CoTs. This results in a mixture of exemplars that exhibit a 
spurious correlation with the current question and exemplars that are genuinely logically relevant 
within the CoT, making it difficult to distinguish between them. To address this issue, we employed 
Principal Component Analysis (PCA) to reduce the embedding dimension € of the M exemplars and 
target sample t;. Subsequently, we recalculated the similarity between each exemplar ej and t; with 
the reduced embedding. 
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Finally, we obtained the top-n exemplars denoted as F,, based on the similarity scores after re-ranking. 
After obtaining En, we concatenate it with x; and input it into the LLMs to obtain the final CoT for 
ICL. With these CoT exemplars, we prompt the LLMs and parse their output similar to previous ICL 
methods to obtain the final answer. 


4 EXPERIMENT 


In our experiments, we evaluate the proposed DQ-LoRe in both independently identically distribution 
and distribution shift settings. Specifically, we conduct experiments under the assumption of indepen- 
dent and identically distributed (i.i.d.) conditions. In the i.i.d. setting, we use the same set of data 
for training the retriever and exemplar selection during testing. In the distribution shift setting, we 
train the retriever on one dataset. Then we retrieve exemplars from another dataset during testing. We 
present the experiment details and results in this section. Introduction to the our baselines is provided 


in the Appendix [B] 


4.1 DATASET 


We conduct experiments on three datasets: AQUA (Ling et al.||2017), GSM8K (Cobbe et al.||2021), 
and SVAMP (Patel et al.|/2021). Among these datasets, AQUA and GSM8K have CoT annotation. 


Since the AQUA contains over ten thousand training examples, constructing the positive and negative 
set for each training data has a high computational cost. Thus, we randomly sample one thousand data 
from AQUA for training our retriever. In addition to our primary focus on mathematical reasoning 
datasets, we conducted experiments on commonsense datasets such as StrategyQA 
and QASC (Khot et al.|[2020). Further details can be found in the Appendix. 


The SVAMP dataset introduces designed perturbations to evaluate whether LLMs learned spurious 
correlations in mathematical word problems, including question sensitivity, structural invariance, and 
reasoning ability. Since SVAMP does not have groundtruth CoT annotations, we generate CoT using 
GPT-3.5 with complex-CoT examples. For each training data in these two datasets, we perform eight 
independent samplings at a temperature of 0.7, selecting one CoT at random from the generation 
results. At last, we acquired 664 training samples with CoTs from SVAMP’s training data. 


When training the retriever, we use Adam optimizer (Kingma & Ba}\|2014)with batch size 64 and 
learning rate le-5, and run training for 120 epochs on 8 NVIDIA 3090 GPUs. For each task, we search 


the (1) LoRe parameter M in {16, 32, 64} and (2) the LoRe final dimension € in {128, 256, 512}. 


4.2 MAIN RESULTS 


Table|I|shows the model’s performance in the i.i.d. setting. It can be seen that our method achieves the 
most promising results on the GSM8K and AQUA datasets. On the SVAMP dataset, if the retriever 
is trained with the generated CoT, our model does not outperform the ERP model. Since the ERP 
tends to capture and exploit these co-occurrence patterns. Thus, ERP can retrieve similar exemplars 
slightly perturbed to the test sample, and perform remarkably well on the SVAMP dataset but slightly 


worse on others (Patel et al.|/2021). Our case study in Section |[4.7]shows the same phenomena. 


To mitigate the impact of these spurious correlations while training smaller models on SVAMP, we 
adopt an alternative training strategy. We train the retriever on the GSM8K dataset and conduct 
retrieval and testing on the SVAMP test set. This approach proved to be effective in neutralizing 
the influence of spurious correlations among questions on smaller models, ultimately leading to a 
commendable 90% accuracy on SVAMP. However, this training approach has a limited effect on 
EPR. We assume this model predominantly relies on word co-occurrence patterns among questions 
and does not consider the similarities between CoTs. We present more detailed case studies in the 


Appendix 


In Table [2] we show the ICL results for GPT-4 on the GSM8K dataset. Our model’s performance 
surpasses the previous state-of-the-art retrieved-based method CEIL by a large margin of 1.7% 
accuracy. 
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Table 1: The accuracy(%) of different models under the i.i.d. setting. Complex-CoT selects the 
most complex CoT from either the annotation or GPT-3.5-Turbo output. All methods select 8-shot 
exemplars except for CoT, which uses 4-shot manually annotated exemplars. SVAMP* represents 
the results obtained by training the retriever on the GSM8K dataset and then conducting testing by 
retrieving exemplars on SVAMP. 


Engine Model GSM8K AQUA SVAMP SVAMP* 
CoT 55.1 35.8 173 : 
Complex-CoT 66.8 46.5 73.0 78.3 
N Auto-CoT 60.7 42.6 80.0 81.3 
Text-davinci-003 EPR 646 450 846 84.6 
CEIL 63.7 47.2 75.3 81.3 
DQ-LoRe 69.1 48.0 83.0 85.0 
CoT 77.0 51.9 82.0 : 
Complex-CoT 79.3 57.0 84.0 79.3 
Auto-CoT 78.4 50.4 86.0 87.3 
GPT-3.5-Turbo EPR 77.3 57.8 89.0 88.0 
CEIL 79.4 54.7 83.7 87.3 
DQ-LoRe 80.7 59.8 85.3 90.0 


Table 2: The accuracy(%) of different ICL methods with GPT-4 on the GSM8K dataset under the 
iid. setting. 


Engine CoT Complex-CoT EPR CEIL DQ-LoRe 
GPT-4 93.0 93.4 91.3 92.5 94.2 


4.3 TEST RESULTS FOR DISTRIBUTION SHIFT 


We evaluate the robustness of different methods if the exemplar set and test problem are from 
different distribution. Specifically, we introduced the Multiarith dataset to create 
a rigorous evaluation scenario with distribution shift. We merged the training and testing sample of 
Multiarith to create a single, comprehensive test dataset comprising a total of 600 diverse questions. 
Our goal was to inspect how well an approach adapts to a distinct distribution while relying solely 
on GSM8K exemplars. This setting can reduce the possibility of high performance bought by the 
spurious correlation such as co-occurrence patterns among question prompts. The experimental 
results show the robustness and generalization ability of an evaluated method. 


The results are shown in Table B] our approach exhibits remarkable robustness, particularly on the 
SVAMP dataset. Our method, which is both trained and retrieved on the GSM8K dataset, successfully 


Table 3: The accuracy(%) under the domain shift setting. Each method is trained on GSM8K and 
tested on corresponding datasets. 


Engine Model SVAMP MultiArith SingleEq 
CoT 77.3 92.3 93.8 
Complex-CoT 78.3 91.5 93.5 
Boa ERP 75.3 92.3 92.5 
eee Gee 76.3 93.5 92.3 
DQ-LoRe 79.6 94.5 93.5 
CoT 82.0 98.0 95.6 
Complex-CoT 79.3 97.8 96.0 
EPR 78.5 98.0 96.3 
oni o> Tune CEIL 81.2 97.3 94.8 
DQ-LoRe 84.0 98.5 96.5 
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reduces the negative effects of word co-occurrence. This underscores the efficacy of our approach in 
addressing the distribution shift issue and spurious correlation in a variety of contexts. 


Moreover, we observe intriguing nuances when examining the performance of our approach on 
two relatively simple datasets SingleEq and MultiArith. Although careful selection of exemplars 
yields incremental performance, the simplest configuration of a fixed 8-shot manually designed CoT 
also achieves competitive performance. In some instances, this straightforward CoT configuration 
outperforms other methods, particularly on the SingleEq dataset when deployed with the text-davinci- 
003 engine. These findings emphasize the versatility and potential of our approach across a spectrum 
of datasets and retrieval scenarios. 


4.4 ABLATION STUDY 


In this section, we provide a detailed analysis of the impact of each component on the experimental 
results. The following results are obtained under the i.i.d. setting for GSM8k, utilizing the inference 
LLM text-davinci-003. 


Table 4: Ablation study of individual components. 
Engine DQ-LoRe DQ-LoRe w/o LoRe EPR EPR with LoRe 
Text-davinci-003 69.1 66.5 64.6 63.7 


The results of the ablation study on the low-rank approximation re-ranking(LoRe) are shown in 
Table[4.4] LoRe has a significant impact on retriever models trained with CoT. On the contrary, it 
exhibits an adverse impact on retrievers that do not make use of CoT information, such as EPR. This 
observation highlights that the efficacy of PCA dimensionality reduction for re-ranking is contingent 
upon the presence of a sufficient volume of information within the chain of CoT. In other words, it is 
effective only when the embeddings of exemplars possess a certain degree of redundancy. 


By comparing the results of DQ-LoRe w/o LoRe and EPR, we observe a 1.9% improvement in 
the final experimental outcomes when employing a retriever trained with CoT enhancement. This 
comparison also underscores the effectiveness of leveraging Quesiton-CoT Pair information, which 
surpasses the utilization of question information in isolation. 


4.5 THE INFLUENCE OF INITIAL EXEMPLARS 


Table 5: The final accuracy(%) with different initial n-shot exemplars in the i.i.d. setting on the 
SVAMP dataset. 


Engine Initial Exemplars SVAMP 
Random 78.3 
EPR 83.6 
Text-davinci-003 CEIL 81.3 


Scoratic-CoT 83.7 
Complex-CoT 83.0 


In this section, we analyze the impact of various methods to obtain initial exemplars on the final 
results. We present the final results in Table [5] It can be observed that the method used to obtain 
initial exemplars has a significant impact on the final results. Specifically, in our experiment, the 
term "Random" refers to the random selection of 8 exemplars from the training set during each 
inference. "EPR" and "CEIL" represent the initial 8-shot exemplars acquired through training on 
SVAMP. Furthermore, "Scoratic-CoT" represents the utilization of Socratic-formatted exemplars from 
Complex-CoT to annotate the SVAMP training set. It’s noteworthy that out of the final 700 training 
data points, we successfully annotated 624 using GPT-3.5-Turbo. Subsequently, we conducted 
training and retrieval using DQ-LoRe with these initial 8-shot exemplars on the resulting Socratic- 
formatted exemplars. It’s worth mentioning that the Socratic-formatted data itself originates from one 
of the format annotations present within the GSM8K dataset. In this experiment, we can discern the 
impact of different initial prompt formats on the final results. We find that Scoratic-CoT outperforms 
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Figure 2: T-SNE visualization results of embedding before and after LoRe. 


Complex-CoT under the i.i.d. setting on SVAMP, indicating that distinct initial prompt formats have 
a significant impact on the model’s ultimate performance. Moreover, the method of selecting initial 
exemplars also affects the model’s final results. Approaches such as EPR, CEIL, and DQ-LoRe, 
which carefully select initial exemplars, perform significantly better than a random selection of 8 
initial exemplars. 


4.6 LORE VISUALIZATION 


In this section, we provide a comprehensive analysis of the impact of LoRe which employs PCA. 
We undertake the direct selection of embeddings from the eight exemplars located farthest from our 
query within the high-dimensional space of the trained encoder. These selected embeddings serve as 
exemplars during the retrieval process and represent the worst cases. Under the i.i.d. setting on the 
GSMB8kK dataset, we employed the text-davinci-003 model, resulting in an accuracy of 48.1% using 
these worst exemplars. This outcome lends credence to the notion that the encoder we have trained 
possesses the capability to effectively discern between ’ good’ and ’bad’ exemplars. Building upon 
this foundation, we proceed to identify and select M exemplars categorized as ’ good’ and ’bad’ based 
on the encoder’s discernment and visualize the embeddings of M exemplars before and after LoRe 
dimensionality reduction using t-SNE (Van der Maaten & Hinton| Por). 

On the SV. dataset, the GSM8K retriever achieves better performance than the SVAMP retriever. 
Hence, we further draw the corresponding t-SNE visualization, which is shown in Figure [2] we 
initially retrieved M=64 embeddings using the retriever in the second query. These embeddings 
were subsequently subjected to dimensionality reduction. Compared with the “good” and “bad” 
embeddings of the SVAMP retriever, the “good” and “bad“ embeddings of the GSM8K retriever 
become more distinguished, suggesting that enlarging the difference between the ’ good’ and ’bad’ 
embeddings can further improve performance. Under the i.i.d. setting on the SVAMP dataset, before 
the LoRe PCA process, the distribution of ’good’ and ’bad’ embeddings is mixed. Following the 
LoRe PCA process, the ’good’ embeddings move outside with a pronounced trend, whereas the 
*bad’ embeddings exhibit a slight trend in the same direction, resulting in an expansion of the gap 
between them. This divergence contributes to performance improvement. Thus, LoRe’s PCA process 
effectively amplifies the distinction between ’good’ and *bad’ embeddings, further enhancing overall 
performance. 


4.7 CASE STUDY 


In this section, we present two typical exemplars retrieved by EPR and DQ-LoRe under the i.i.d. 
setting on the SVAMP dataset. From Table |6} it is visually evident that although EPR achieves the 
highest accuracy on SVAMP, it relies on patterns of word co-occurrence and happens to find the 
right prompt as a shortcut. EPR has learned this phenomenon of spurious correlations. The model 
exhibited a tendency to group together all exemplars featuring word co-occurrence, which is prevalent 
in the SVAMP dataset due to its numerous analogous questions that mainly differ in numerical values. 
Consequently, EPR had no trouble retrieving exemplars marked by significant word co-occurrence 
and similar problem-solving approaches. 
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Table 6: Case Study on SVAMP 


Problem 

Queston: Mary is baking a cake. The recipe calls for 6 cups of flour 8 cups of sugar and 7 cups of 
salt. She already put in 5 cups of flour.How many more cups of sugar than cups of salt does she need 
to add now? 

EPR Prompt 

Question: Mary is baking a cake. The recipe calls for 12 cups of flour 14 cups of sugar and 7 cups of 
salt. She already put in 2 cups of flour.How many more cups of flour than cups of salt does she need 
to add now? 

A: Let’s think step by step To find out how many more cups of flour than cups of salt Mary needs to 
add now, we first need to calculate the total amount of flour and salt needed for the recipe. The recipe 
calls for i12 cups of flour and Mary has already added 2 cups, so she still needs 12 - 2 = 10 cups of 
flour. The recipe also calls for 7 cups of salt and Mary has not added any yet, so she needs 7 cups of 
salt. Therefore, Mary needs to add 10 - 7 = 3 more cups of flour than cups of salt. The answer is 3. 
Question: Mary is baking a cake. The recipe calls for 9 cups of sugar 7 cups of flour and 4 cups of 
salt. She already put in 2 cups of flour.How many more cups of flour than cups of salt does she need 
to add now? 

A: Let’s think step by step Mary needs to add 7 - 2 = 5 more cups of flour. She needs to add 5 - 4 = 1 


more cup of flour than cups of salt. Answer: | 1 |. The answer is 1. 


DQ-LoRe (ours) 

Question: Mary is baking a cake. The recipe calls for 5 cups of sugar and 14 cups of flour. She 
already put in 11 cups of flour. How many more cups of sugar than cups of flour does she need to add 
now? 

A: Let’s think step by step. The recipe requires 5 cups of sugar and 14 cups of flour.She already 
added 11 cups of flour, so she still needs to add 14 - 11 = 3 cups of flour.For the sugar, she needs 5 
cups and she hasn’t added any yet, so she needs to add all 5 cups.Therefore, she needs to add 5 - 3 = 
2 more cups of sugar than cups of flour. The answer is 2. 

Queston: Rachel had to complete 5 pages of math homework and yet more pages of reading 
homework. If she had to complete a total of 7 pages of homeworkHow many pages of reading 
homework did she have to complete ? 

A: Let’s think step by step. Rachel had to complete 5 pages of math homework and yet more pages 
of reading homework. If she had to complete a total of 7 pages of homework, then the number of 
reading homework pages can be found by subtracting the number of math homework pages from the 
total number of homework pages. So, Rachel had to complete 7 - 5 = 2 pages of reading homework. 
Therefore, Rachel had to complete 2 pages of reading homework. The answer is 2. 


However, our model not only identifies exemplars with word co-occurrence and similar problem- 
solving approaches but, more importantly, it can also find exemplars that lack word co-occurrence 
but share a common problem-solving approach. This implies that our model can discover deeper 
logical associations and connections between exemplars. The complete results for EPR, CEIL, and 
DQ-LoRe on the SVAMP dataset are presented in Appendix [G] 


5 CONCLUSION 


In our study, we introduce an innovative approach called DQ-LoRe, a dual-query framework that 
enhances in-context learning for multi-step reasoning tasks by considering the chain-of-thoughts in 
input questions and exemplars, followed by employing PCA to filter out redundant information from 
CoT-containing exemplars and subsequent re-ranking to obtain the final exemplars. This approach 
effectively tackles the intricate issue of exemplar selection. Additionally, we harness dimensionality 
reduction techniques to extract essential reasoning insights from CoTs, facilitating the discernment 
of distinctions among various exemplars. Our experimental results demonstrate that DQ-LoRe 
outperforms existing methods, exhibiting remarkable efficacy, particularly in scenarios involving 
distribution shifts. This underscores its robustness and versatility across a wide range of situations. 
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APPENDIX 


A OTHER RELATED WORKS 


Low Rank Approximation Principal component analysis (PCA;|Wold et al.}1987), is a dimen- 
sionality reduction method that is often used to reduce the dimensionality of large data sets, by 
transforming a large set of variables into a smaller one that still contains most of the information 


in the large set. PCA is widely used in various natural language processing tasks 2020 


2021} 2021). In|Su et al.|(2021), it was observed that retaining only the first 
256 dimensions of the 768-dimensional embeddings from the BERT-base (Devlin et al.|/2018) led 


to an improvement in performance, and due to dimensionality reduction, significantly accelerated 
vector retrieval speed. This finding suggests that unsupervised training of sentence embeddings 
produces "universal" representations that contain many redundant features for specific domain ap- 
plications. Removing these redundant features often leads to a dual benefit of enhanced speed and 


effectiveness (Wang et al.||2023b 2022). 
Inspired by|Li et al.|(2020):|Su et al.|(2021); (2021), after obtaining a set of M exemplars, 


we perform dimensionality reduction on the embeddings of these M exemplars. We employ Principal 
Component Analysis (PCA) [Wold et al.] et al. (1987) 1987), as we believe it can effectively extract key reasoning 
information from the CoT to EENE as different exemplars. In{Su et al.|(2021), it was 
observed that retaining only the first 256 dimensions of the 768-dimensional embeddings from the 
BERT-base led to an improvement in performance, and due to dimensionality 
reduction, significantly accelerated vector retrieval speed. This finding suggests that unsupervised 
training of sentence embeddings produces "universal" representations that contain many redundant 
features for specific domain applications. Removing these redundant features often leads to a dual 
benefit of enhanced speed and effectiveness (Wang et al.|[2023b} [Pan et al.|/2022). This result also 
inspires us in the context of retrieval-based methods in in-context learning. The sentence embeddings 
within the retrieved exemplars often contain a substantial amount of similar information. We can 
employ PCA for dimensionality reduction to filter out redundant information and achieve the goal 
of distinguishing these exemplars. Additionally, there are other methods for vector dimensionality 
reduction, such as traditional Singular Value Decomposition (SVD) or low-rank approximation 
techniques 1999). Furthermore, parameter-based training methods for vector 
reduction have been proposed (2021). Another approach to enhance the separability of these 
exemplars is by using dimensionality expansion techniques, such as the Maximal Margin Hyperplane 
(MMH) method [1998). In contrast to the methods mentioned above, considering 
the advantages of not requiring training and having lower computational overhead, we opted for 
employing PCA for low-rank approximations. 


Spurious Correlation Phenomenon in Math Word Problems. In this paper, we primarily investi- 
gate how to enhance multi-step reasoning tasks, such as math word problems, through the utilization 
of improved in-context learning methods. This task has been reported to exhibit a substantial spurious 


correlation phenomenon in prior research, even in small-scale language models (Patel et al.|/2021 
2022 2022). Models struggle to differentiate equivalent solutions, and we 


observe the corresponding phenomenon in large models as well, where semantic similarity does 

not necessarily imply logical equivalence in the context of CoT. Previous works 

have employed contrastive learning to mitigate this bias by aligning the model’s 

intermediate representations when producing equivalent solutions. Inspired by |[Karpukhin et al.] 
(2020), in this paper, we also employ contrastive learning techniques to train the associativity between 

our CoTs and Exemplars, enhancing the robustness of the encoder. 
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B INTRODUCTION OF OUR BASELINES 


We compare DQ-LoRe to an extensive set of baselines and state-of-art models, details are provided 
as follow: 


CoT (Wei et al.||2022b) has delved into enhancing the learning process through the incorporation of 
chain-of-thoughts, which entails presenting a series of intermediate reasoning steps in conjunction 


with the relevant in-context example. Prompting LLMs to engage in CoT can lead to more accurate 
outcomes, as the process of articulating each step can encourage the model to consider aspects of the 
problem it might have otherwise overlooked. CoT can be applied across various domains, including 
mathematical, logical, symbolic, and any area where complex reasoning is required. 


Complex-CoT further enhances CoT on complex reasoning tasks by selecting 
exemplars with the most intricate CoTs from the training set. Specifically, multiple reasoning chains 
were sampled from the model, and then the majority of generated answers from complex reasoning 
chains (over simple chains) were chosen. When used to prompt LLMs like GPT-3 and Codex, 
Complex-CoT substantially improves multi-step reasoning accuracy. 


Auto-CoT (Zhang et al.||2022b), on the other hand, leverages diversity within the training set to 


identify CoTs with maximal distinctions. The rationale behind this approach is to extract a richer 
spectrum of information. Empirical findings from their experiments substantiate this perspective. 


EPR (Rubin et al.}/2021) aims to retrieve prompts for in-context learning using annotated data and 
a language model. Given an input-output pair, EPR estimates the probability of the output given 


the input and a candidate training example as the prompt, and label training examples as positive or 
negative based on this probability. Next, an efficient dense retriever is trained from this data, which is 
used to retrieve training examples as prompts at test time. During this process, the interaction between 
input questions and in-context examples is better modelled, optimizing this interaction through a 
contrastive learning objective helps to identify and prioritize preferred exemplars. 


CEIL formulates in-context example selection as a subset selection problem. 
Specifically, CEIL employs Determinantal Point Processes to capture the interaction between the 
provided input and in-context examples. It is refined through a meticulously crafted contrastive 
learning objective, aiming to discern preferences from language models. 


C EXPERIMENTS ON COMMONSENSE DATASETS 


In this section, we conducted experiments on commonsense datasets such as StrategyQA and QASC 
in gpt-3.5-turbo-16k. The detailed experimental results are presented in the table below: 


Table 7: Experiments on Commonsnese Datasets. 


Model StrategyQA QASC 
CoT 73.8 81.8 
Complex CoT 74.5 75.8 
Auto-CoT 71.2 74.1 
EPR 73.4 80.2 
CEIL 73.4 80.8 
DQ-LoRe(PCA) 74.6 81.2 
DQ-LoRe(Gaussian kernel) 75.4 82.7 


For the question-answering dataset, we introduce LoRe with Gaussian kernel, where the PCA step is 
replaced with a Gaussian kernel to compute the similarity between the Exemplar’s embedding and 
the current inference question + CoT (obtained from the first query) embedding. 


The motivation for adopting the kernel method here is to preserve as much information as possible in 
the embeddings on commonsense datastes. Although this is not a dimensionality reduction approach, 
the fundamental idea remains unchanged. That is, first, the embeddings of M similar Exemplars 
are queried, then mapped onto another vector space. After this mapping, the similarity of these 
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embeddings to the current problem’s question+CoT embedding is recalculated, followed by re-ranking 
to yield the final N Exemplars (M>N). 


In the realm of commonsense datasets, our approach remains state-of-the-art, and an interesting 
discovery is that using a kernel trick to map the Exemplar’s embedding into a higher-dimensional 
space for reordering, instead of PCA for dimensionality reduction, can effectively enhance the model’s 
performance. There is even a 1.5% performance improvement on the QASC dataset compared to 
PCA.For commonsense datastes, there is a significant difference compared to solving math problems. 
Specifically, the background knowledge required for Math word problems is presented within the 
question itself, while the factual information for commonsense datastes needs to be queried from 
Large Language Models (LLMs). Moreover, this task is sensitive to textual details. Therefore, 
filtering information in embeddings is not a good choice for commonsense datastes, as dimensionality 
reduction could remove some vital entity information, which is not suitable for commonsense datastes. 
Hence, we have adopted a kernel method to transform the embeddings, ensuring that the information 
within the embeddings is not lost while still becoming separable. 


D EXPERIMENTS WITH DIFFERENT ENCODERS AND ABLATION OF LORE 


In this section, we investigate the effects of utilizing different base models on the experimental 
outcomes in gpt-3.5-turbo-16k. Our findings indicate that, in terms of overall performance, the 
bert-base-uncased model outperforms the ROBERTa-base model, except for an enhancement observed 
in EPR when employing the ROBERTa-base. Furthermore, we assess the influence of LoRe on various 
models and note that the adoption of LoRe leads to a decline in performance for both EPR and CEIL. 
This decline can be primarily attributed to the absence of the Dual Query process, which hampers the 
encoding of additional Chain-of-Thought (CoT) information. 


Table 8: Experiments with Different Encoders and Ablation of LoRe Across Baselines on GSM8K. 


Method BERT-base-uncased RoBERTa-base 
EPR 77.3 78.0 
EPR + LoRe 77.0 7178 
CEIL 79.4 71.1 
CEIL + LoRe 78.7 71.4 
Dq-LoRe w/o LoRe 78.9 76.4 
DQ-LoRe 80.7 78.8 


E EXPERIMENTS ON MORE LLMS 


In this section, we conducted experiments on a broader set of LLMs using the GSM8K dataset. The 
results are presented in the table below. 


Table 9: Experiments on more LLMs on GSM8K. 
Model Llama2-7b-hf Llama2-7b-chat-hf Vicuna-7b 


Cot 12.6 22.9 19.4 
Complex Cot 17.7 29.4 23.8 
Auto-Cot 15.0 27.0 23.9 
EPR 15.1 23.0 22.0 
CEIL 15.2 26.7 22.4 
DQ-LoRe 16.0 28.9 23.8 


We observe that compared to the retrieval baselines (EPR and CEIL), our method demonstrates 
significant improvements even on models with smaller open-source parameter sizes. Moreover, the 
effectiveness of our approach increases with the improvement in the model’s instruction-following 
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capabilities. However, we also acknowledge that, when compared to Complex cot and Auto-cot, our 
method does not show as significant improvements on the LLaMa family 7b-scale models. 


F ABLATION STUDY OF RE-RANKING 


In this section, we conducted a dedicated ablation study on the re-ranking operation. In addition to 
this, we supplemented various re-ranking methods, aside from PCA. 


To begin, we compared the impact of various re-ranking methods. ’w/o re-ranking’ serves as our 
baseline, where no re-ranking was performed. ’Longest’ represents a re-ranking approach that extends 
the Complex-CoT method. In this method, exemplars are selected for the final inference step based 
on the longest 8 CoT exemplars from the initial retrieval of 16 or 32 exemplars (the selection of 
16 or 32 exemplars is a tunable hyperparameter). In addition to these, we conducted experiments 
employing a ESTES aT ROOT (Modified Gram-Schmidt Process) 
{1981}, and Gaussian Kernel (Scholkopf et al.[]1997) for re-ranking. The detailed experimental results 
are presented in the table below: 

Table 10: Comparison of Different Inference LLMs and Ablation of Re-Ranking Algorithms on 


GSM8K ee 
Re-ranking method Text-Davinci-003 GPT3.5-Turbo-16k 


w/o re-ranking 67.9 78.9 
Longest 68.1 78.9 
BERT-Whitening 67.9 78.9 
MGS 68.2 79.7 
Gaussian Kernel 66.7 79.0 
PCA 69.1 80.7 


As illustrated in the table above, we establish a baseline by omitting the re-ranking step and employing 
a single-round ranking model. Subsequently, we undertake a comprehensive analysis of the impact of 
employing various algorithms for the second-round ranking. It is noteworthy that only two methods, 
namely MGS and PCA, exhibit a discernible impact on the re-ranking process. This suggests that the 
effectiveness of the re-ranking process is attributed not only to dimensionality reduction of vectors 
but also to the orthogonalization of vectors, both playing significant roles. In contrast, kernel tricks 
such as the Gaussian Kernel do not appear to contribute significantly to the sample distinction in this 
context. Notably, in the case of text-davinci-003, we observed a performance decline of 1.2%. 


G CASE STUDY ON SVAMP 


In this section, we present exemplars retrieved for a single data point on the SVAMP dataset. Tables 
and[I5]respectively display the 8-shot results retrieved by EPR, CEIL, and DQ-LoRe. 
It can be observed that EPR relies on the pattern of word co-occurrence for retrieval, while CEIL 
retrieves some unrelated examples. On the other hand, DQ-LoRe not only considers exemplars with 
word co-occurrence and logical relevance but also identifies exemplars with logical relevance even in 
the absence of word co-occurrence. 
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Table 11: EPR Case Study on SVAMP: Part 1 


Problem 

Question: Mary is baking a cake. The recipe calls for 6 cups of flour 8 cups of sugar and 7 cups of 
salt. She already put in 5 cups of flour.How many more cups of sugar than cups of salt does she need 
to add now? 

EPR Prompt 

Question: Mary is baking a cake. The recipe calls for 12 cups of flour 14 cups of sugar and 7 cups of 
salt. She already put in 2 cups of flour.How many more cups of flour than cups of salt does she need 
to add now? 

A: Let’s think step by step To find out how many more cups of flour than cups of salt Mary needs to 
add now, we first need to calculate the total amount of flour and salt needed for the recipe. The recipe 
calls for 12 cups of flour and Mary has already added 2 cups, so she still needs 12 - 2 = 10 cups of 
flour. The recipe also calls for 7 cups of salt and Mary has not added any yet, so she needs 7 cups of 
salt. Therefore, Mary needs to add 10 - 7 = 3 more cups of flour than cups of salt. The answer is 3. 
Question: Mary is baking a cake. The recipe calls for 9 cups of sugar 7 cups of flour and 4 cups of 
salt. She already put in 2 cups of flour.How many more cups of flour than cups of salt does she need 
to add now? 

A: Let’s think step by step Mary needs to add 7 - 2 = 5 more cups of flour. She needs to add 5 - 4 = 1 


more cup of flour than cups of salt. Answer: | 1 |. The answer is 1. 

Question: Mary is baking a cake. The recipe calls for 14 cups of flour and 6 cups of sugar. She 
already put in 5 cups of flour.How many more cups of flour than cups of sugar does she need to add 
now? 

A: Let’s think step by step Mary needs to add 14 - 5 = 9 more cups of flour. She needs to add 9 - 6 = 
3 more cups of flour than cups of sugar. Answer: | 3 |. The answer is 3. 

Question: Mary is baking a cake. The recipe calls for 11 cups of flour and 7 cups of sugar. She 
already put in some cups of flour. If she still needs 2 more cups of flour than sugarHow many cups of 
flour did she put in? 

A: Let’s think step by step. Let’s assume Mary already put x cups of flour. According to the 
information given, she still needs 2 more cups of flour than sugar, so she needs (7 + 2) = 9 cups 
of sugar. Since the recipe calls for 11 cups of flour, the total amount of flour needed is 11 cups. 
Therefore, the equation is x + 9 = 11. Subtracting 9 from both sides of the equation, we get x = 11 - 9 
= 2. Mary has already put in 2 cups of flour. The answer is 2. 

Question: Mary is baking a cake. The recipe calls for 11 cups of sugar and some more cups of flour. 
She already put in 3 cups of flour. If she still needs to add 6 more cups of flourHow many cups of 
flour did the recipe require? 

A: Let’s think step by step: Mary already put in 3 cups of flour, and she still needs to add 6 more 
cups of flour. So, the total amount of flour she needs to add is 3 + 6 = 9 cups. Therefore, the recipe 
requires 9 cups of flour. The answer is 9. 

Question: Mary is baking a cake. The recipe calls for 3 cups of sugar 10 cups of flour and 15 cups of 
salt. She already put in 6 cups of flour.How many more cups of flour does she need to add? 

A: Let’s think step by step To find out how many more cups of flour Mary needs to add, we need 
to subtract the amount of flour she already put in from the total amount of flour required. The total 
amount of flour required is 10 cups, and Mary already put in 6 cups. Therefore, she needs to add 10 - 
6 =4 more cups of flour. Mary needs to add 4 cups of flour. The answer is 4. 
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Table 12: EPR Case Study on SVAMP: Part 2 
Problem Question: Mary is baking a cake. The recipe calls for 9 cups of flour and 11 cups of sugar. 
She already put in 4 cups of flour. How many more cups of sugar than cups of flour does she need to 
add now? 
A: Let’s think step by step Mary needs to add 9 - 4 = 5 more cups of flour. Mary needs to add 11 - 5 


= 6 more cups of sugar than cups of flour. Answer: | 6 |. The answer is 6. 

Question: Mary is baking a cake. The recipe calls for 12 cups of sugar and 14 cups of flour. She 
already put in 10 cups of sugar.How many more cups of flour than cups of sugar does she need to add 
now? 

A: Let’s think step by step The recipe calls for 12 cups of sugar and she already put in 10 cups, so 
she needs to add 12 - 10 = 2 more cups of sugar. The recipe calls for 14 cups of flour and she hasn’t 
added any yet, so she needs to add 14 cups of flour. To find the difference between the cups of flour 
and cups of sugar, we subtract the cups of sugar from the cups of flour: 14 - 2 = 12. She needs to add 
12 more cups of flour than cups of sugar. The answer is 12. 
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Table 13: CEIL Case Study on SVAMP 


Problem 

Question: Mary is baking a cake. The recipe calls for 6 cups of flour 8 cups of sugar and 7 cups of 
salt. She already put in 5 cups of flour.How many more cups of sugar than cups of salt does she need 
to add now? 

CEIL Prompt 

Question: A mailman has to give 4 pieces of junk mail to each house in each of the 81 blocks. If 
there are 12 houses in each blockHow many pieces of junk mail should he give in each block? 

A: Let’s think step by step There are 12 houses in each block, and the mailman has to give 4 pieces of 
junk mail to each house. So, in each block, the mailman should give 12 houses x 4 pieces of junk 
mail = 48 pieces of junk mail. The answer is 48. 

Question: After eating a hearty meal they went to see the Buckingham palace. There, Rachel learned 
that 132 visitors came to the Buckingham palace that day. If 406 people visited the Buckingham 
palace within the past 327 daysHow many visitors visited the Buckingham palace on the previous 
day? 

A: Let’s think step by step: To find out how many visitors visited the Buckingham palace on the 
previous day, we need to subtract the number of visitors on the current day from the total number 
of visitors in the past 327 days. So, the number of visitors on the previous day is 406 - 132 = 274. 
Therefore, 274 visitors visited the Buckingham palace on the previous day. The answer is 274. 
Question: There are a total of 16 peaches in a basket. If there are 13 red peaches and some more 
green peaches in the basket.How many green peaches are in the basket? 

A: Let’s think step by step If there are a total of 16 peaches in the basket and 13 of them are red, then 
the remaining peaches must be green. So, the number of green peaches in the basket is 16 - 13 = 3. 
The answer is 3. 

Question: If each bag has 41 cookies and you had 53 bags of cookiesHow many cookies would you 
have? 

A: Let’s think step by step If each bag has 41 cookies, and you have 53 bags, then the total number of 
cookies would be 41 x 53 = 2173 cookies. The answer is 2173. 

Question: The Ferris wheel in paradise park has 2 small seats and 23 large seats. Each small seat 
can hold 14 people and large seat can hold 54 people.How many people can ride the Ferris wheel on 
small seats? 

A: Let’s think step by step The number of people that can ride on the small seats is equal to the 
number of small seats multiplied by the capacity of each small seat. There are 2 small seats and each 
small seat can hold 14 people, so the total number of people that can ride on the small seats is 2 x 14 
= 28. The answer is 28. 

Question: Frank was reading through his favorite book. The book had 41 chapters, each with the 
same number of pages. It has a total of 450 pages. It took Frank 30 days to finish the book.How many 
pages did he read per day? 

A: Let’s think step by step. If the book has a total of 450 pages and 41 chapters, then each chapter 
has 450/41 = 10.975609756097561 pages (rounded to the nearest whole number). Since each chapter 
has the same number of pages, Frank read approximately 11 pages per chapter. If Frank finished the 
book in 30 days, then he read 450/30 = 15 pages per day. Therefore, Frank read approximately 15 
pages per day. The answer is 15. 

Question: Julia played tag with 16 kids on monday. If she played tag with 12 more kids on monday 
than on tuesdayHow many kids did she play with on tuesday? 

A: Let’s think step by step: If Julia played with 16 kids on Monday and 12 more kids than on Tuesday, 
then the number of kids she played with on Tuesday is 16 - 12 = 4. Therefore, Julia played with 4 
kids on Tuesday. The answer is 4. 

Question: Lewis earns $2 every week during the harvest. If he earns a total of $ 178How many weeks 
did the harvest last? 

A: Let’s think step by step: If Lewis earns $2 every week during the harvest, and he earns a total of 
$178, we can divide the total earnings by the amount he earns per week to find the number of weeks. 
$178 / $2 = 89 weeks Therefore, the harvest lasted for 89 weeks. The answer is 89. 
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Table 14: EhochFusion Case Study on SVAMP: Part 1 


Problem 

Question: Mary is baking a cake. The recipe calls for 6 cups of flour 8 cups of sugar and 7 cups of 
salt. She already put in 5 cups of flour.How many more cups of sugar than cups of salt does she need 
to add now? 

DQ-LoRe (ours) 

Question: Mary is baking a cake. The recipe calls for 5 cups of sugar and 14 cups of flour. She 
already put in 11 cups of flour.How many more cups of sugar than cups of flour does she need to add 
now? 

A: Let’s think step by step. The recipe requires 5 cups of sugar and 14 cups of flour.She already 
added 11 cups of flour, so she still needs to add 14 - 11 = 3 cups of flour.For the sugar, she needs 5 
cups and she hasn’t added any yet, so she needs to add all 5 cups.Therefore, she needs to add 5 - 3 = 
2 more cups of sugar than cups of flour. The answer is 2. 

Question: Mary is baking a cake. The recipe calls for 9 cups of sugar 7 cups of flour and 4 cups of 
salt. She already put in 2 cups of flour.How many more cups of flour than cups of salt does she need 
to add now? 

A: Let’s think step by step. Mary needs to add 7 - 2 = 5 more cups of flour. She needs to add 5 - 4= 1 


more cup of flour than cups of salt. Answer: | 1 |. The answer is 1. 

Question: Mary is baking a cake. The recipe calls for 11 cups of sugar and some more cups of flour. 
She already put in 3 cups of flour. If she still needs to add 6 more cups of flour.How many cups of 
flour did the recipe require? 

A: Let’s think step by step. Mary already put in 3 cups of flour, and she still needs to add 6 more 
cups of flour. So, the total amount of flour she needs to add is 3 + 6 = «3+6=9»9 cups. Therefore, the 
recipe requires 9 cups of flour. The answer is 9. 

Question: Mary is baking a cake. The recipe calls for 6 cups of sugar and 14 cups of flour. She 
already put in 7 cups of flour and 60 cups of sugar.How many more cups of flour does she need to 
add? 

A: Let’s think step by step. To find out how many more cups of flour Mary needs to add, we need to 
subtract the amount of flour she already put in from the total amount required. Total cups of flour 
required = 14 cups Flour already put in = 7 cups To find the remaining amount of flour needed, we 
do: Remaining flour needed = Total flour required - Flour already put in Remaining flour needed = 14 
cups - 7 cups Remaining flour needed = 7 cups Mary needs to add 7 more cups of flour. The answer 
is 7. 

Question: Helen the hippo and her friends are preparing for thanksgiving at Helen’s house. Helen 
baked 519 chocolate chip cookies and 300 raisin cookies yesterday. And she baked 280 raisin cookies 
and 359 chocolate chip cookies this morning. How many more raisin cookies did Helen bake yesterday 
compared to today? 

A: Let’s think step by step. To find out how many more raisin cookies Helen baked yesterday 
compared to today, we need to subtract the number of raisin cookies she baked today from the number 
she baked yesterday. Yesterday, Helen baked 519 chocolate chip cookies and 300 raisin cookies, for 
a total of 519 + 300 = 819 cookies. Today, Helen baked 280 raisin cookies and 359 chocolate chip 
cookies, for a total of 280 + 359 = 639 cookies. To find the difference in the number of raisin cookies, 
we subtract the number of raisin cookies baked today from the number baked yesterday: 300 - 280 = 
20. Therefore, Helen baked 20 more raisin cookies yesterday compared to today. The answer is 20. 
Question: Mary is baking a cake. The recipe calls for 5 cups of flour and 14 cups of sugar. She 
already put in 10 cups of flour and 2 cups of sugar.How many more cups of sugar does she need to 
add? 

A: Let’s think step by step. To find out how many more cups of sugar Mary needs to add, we need to 
subtract the amount of sugar she already put in from the total amount required. The recipe calls for 
14 cups of sugar and Mary already put in 2 cups, so she still needs to add 14 - 2 = 12 more cups of 
sugar. Therefore, Mary needs to add 12 more cups of sugar. The answer is 12. 

The answer is 2. 
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Table 15: DQ-LoRe Case Study on SVAMP: Part 2 
Problem Question: Matthew gave equal numbers of crackers and cakes to his 4 friends. If he had 32 
crackers and 98 cakes initially. How many crackers did each person eat? 
A: Let’s think step by step. If Matthew gave equal numbers of crackers and cakes to his 4 friends, 
then each friend received the same number of crackers and cakes. Let’s assume that each friend 
received x number of crackers and x number of cakes. Since Matthew initially had 32 crackers and he 
gave an equal number to each friend, the total number of crackers given to his friends is 4x. Similarly, 
since Matthew initially had 98 cakes and he gave an equal number to each friend, the total number of 
cakes given to his friends is also 4x. We can set up the equation: 4x = 32 Solving for x, we divide 
both sides of the equation by 4: x = 32/4 = 8 Therefore, each person ate 8 crackers. The answer is 8. 
Question: Rachel had to complete 5 pages of math homework and yet more pages of reading 
homework. If she had to complete a total of 7 pages of homework. How many pages of reading 
homework did she have to complete? 
A: Let’s think step by step. Rachel had to complete 5 pages of math homework and yet more pages 
of reading homework. If she had to complete a total of 7 pages of homework, then the number of 
reading homework pages can be found by subtracting the number of math homework pages from the 
total number of homework pages. So, Rachel had to complete 7 - 5 = 2 pages of reading homework. 
Therefore, Rachel had to complete 2 pages of reading homework. 
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